Members
Overall Objectives
Research Program
Application Domains
Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: New Results

Scalable Data Analysis

Scalable Mining of Small Visual Objects

Participants : Pierre Letessier, Julien Champ, Alexis Joly.

Automatically linking multimedia documents that contain one or several instances of the same visual object has many applications including: salient events detection, relevant patterns discovery in scientific data or simply web browsing through hyper-visual links. Whereas efficient methods now exist for searching rigid objects in large collections, discovering them from scratch is still challenging in terms of scalability, particularly when the targeted objects are small compared to the whole image. In a previous work, we revisited formally the problem of mining or discovering such objects, and then generalized two kinds of existing methods for probing candidate object seeds: weighted adaptive sampling and hashing based methods. This year, we continued working on the subject by improving our high-dimensional data hashing strategy, that works first at the visual level, and then at the geometric level. We conducted new experiments on a dedicated evaluation dataset (http://www-sop.inria.fr/members/Alexis.Joly/BelgaLogos/FlickrBelgaLogos.html ) and we did show that our the recall or our approach definitely outperforms the reference method [46] .

Based on this contribution, we then address the problem of suggesting object-based visual queries in a multimedia search engine [22] , [36] . State-of-the-art visual search systems are usually based on the query-by-window paradigm: a user selects any image region containing an object of interest and the system returns a ranked list of images that are likely to contain other instances of the query object. User's perception of these tools is however affected by the fact that many submitted queries actually return nothing or only junk results (complex non-rigid objects, higher-level visual concepts, etc.). In [22] , we addressed the problem of suggesting only the object's queries that actually contain relevant matches in the dataset. This requires to first discover accurate object's clusters in the dataset (as an offline process); and then to select the most relevant objects according to user's intent (as an on-line process). We therefore introduce a new object's instances clustering framework based on a bipartite shared-neighbours clustering algorithm that is used to gather object's seeds discovered by our visual mining method. Shared nearest neighbours methods were not studied beforehand in the case of bipartite graphs and never used in the context of object discovery. Experiments show that this new method outperforms state-of-the-art object mining and retrieval results on the Oxford Building dataset. We finally describe two real-word object-based visual query suggestion scenarios using the proposed framework and show examples of suggested object queries. A demo was presented at ACM Multimedia 2013 [36] .

This method was finally integrated within a visual-based media event detection system in the scope of a French project called the Transmedia Observatory [33] . It allows the automatic discovery of the most circulated images across the main news media (news websites, press agencies, TV news and newspapers). The main originality of the detection is to rely on the transmedia contextual information to denoise the raw visual detections and consequently focus on the most salient trans-media events.

Rare Events Identification for Large-Scale Applications

Participant : Florent Masseglia.

While significant work in data mining has been dedicated to the detection of single outliers in the data, less research has approached the problem of isolating a group of outliers, i.e. rare events representing micro-clusters of less – or significantly less – than 1% of the whole dataset. This research issue is critical for example in medical applications. The problem is difficult to handle as it lies at the frontier between outlier detection and clustering and distinguishes by a clear challenge to avoid missing true positives. In [41] , we address this challenge and propose a novel two-stage framework, based on a backward approach, to isolate abnormal groups of events in large datasets. The key of our backward approach is to first identify the core of the dense regions and then gradually augment them based on a density-driven condition. The framework outputs a small subset of the dataset containing both rare events and outliers. We tested our framework on a biomedical application to find micro-clusters of pathological cells. The comparison against two common clustering (DBSCAN) and outlier detection (LOF) algorithms show that our approach is a very efficient alternative to the detection of rare events – generally a recall of 100% and a higher precision, positively correlated wih the size of the rare event – while also providing a 𝒪(N) solution to the existing algorithms dominated by a 𝒪(N2) complexity.

Large-scale content-based plants identification from social image data

Participants : Hervé Goëau, Alexis Joly, Julien Champ, Saloua Litayem.

Speeding up the collection and integration of raw botanical observation data is a crucial step towards a sustainable development of agriculture and the conservation of biodiversity. Initiated in the context of a citizen sciences project in collaboration with the botanists of the AMAP UMR team and Tela Botanica social network, the overall contribution of this work [23] is an innovative collaborative workflow focused on image-based plant identification as a mean to enlist new contributors and facilitate access to botanical data. Since 2010, hundreds of thousands of geo-tagged and dated plant photographs were collected and revised by hundreds of novice, amateur and expert botanists of a specialized social network. An image-based identification tool - available as both a web and a mobile application - is synchronized with that growing data and allows any user to query or enrich the system with new observations. Extensive experiments of the visual search engine as well as system-oriented and user-oriented evaluations of the application did show that it is very helpful to determine a plant among hundreds or thousands of species [23] . As a concrete result, more than 80K people in about 150 countries did download the iPhone end point of the application [32] .

From a data management and data analysis perspective, our main contribution concerns the scalability of the system. At the time of writing, the content-based search engine actually works on 120K images covering more than 5000 species (which already makes it the largest identification tool built anytime). The resulting training dataset contains several hundreds of millions feature vectors, each with several hundreds of float attributes (i.e. high-dimensional feature vectors describing the visual content). At query time, thousands of such feature vectors are extracted from the query pictures and have to be searched online in the training set to find the most similar pictures. The underlying search of approximate nearest neighbors is speed-up thanks to a data-dependent high-dimensional hashing framework based on Random Maximum Margin Hashing (RMMH), a new hash function family that we introduced in 2011. RMMH is used for both compressing the original feature vectors into compact binary hash codes and for partitioning the data into a well balanced hash table. Search is then performed through adaptive multi-probe accesses in the hash table and a top-k search refinement step on the full binary hash codes. Last improvements brought in 2013 include a multi-threaded version of the search, the use of a probabilistic asymmetric distance instead of the Hamming distance and the integration of a query optimization training stage in the compressed feature space instead of the original space. A beta version of Pl@ntNet visual search engine based on these new contributions is currently being tested and is about 8 times faster than the one used in production.

Besides scalability and efficiency, we also did work on improving the identification performances of the system [29] . We notably improved the quality of the top-K returned images by weighting each match according to its Hamming distance to the query rather than using a simple vote. We then improved the multi-cue fusion strategy by indexing separately each type of visual features rather than concatenating them in an early phase. We finally did train the optimal selection of features for each of the considered plant organ (flower, leaf, bark, fruit). Beyond the use of the visual content itself, we explored the usefulness of associated metadata and we did prove that some of them like the date can improve the identification performances (contrary to the geo-coordinates that surprisingly degraded the results). Overall, as a result of our participation to ImageCLEF plant identification benchmark [34] , we obtained the second best run among 12 international groups and a total of 33 submitted runs.